-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out extra fields, deduplicate fields in ingestion #404
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhilingc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
String.format( | ||
"FeatureRow contains field '%s' which do not exists in FeatureSet '%s' version '%d'. Please check the FeatureRow data.", | ||
field.getName(), featureSet.getReference()); | ||
// skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we logged statistics in terms of unnecessary fields so that it doesnt result an an actual error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible, but is it possible if i opened a separate PR for this? It would involve introducing a new TupleTag (for warnings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use our metrics client here? We wont get information on the specific fields but at least we would know something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to bubble the error down to the metrics writing fn
@@ -81,6 +82,7 @@ public void processElement(ProcessContext context) { | |||
break; | |||
} | |||
} | |||
fields.add(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a collision be happening here? And if so, how will it be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to detect collision at the name level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a collision the 2nd element will not be written. I'm not sure if i want to commit to detecting collision at the name level at this point because it could result in data loss (that would go uncaught by the user)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if we have two fields with the same name, what happens to the data when it is written to the store? (in the most pathological case)
@zhilingc It's definitely a trade-off if we want to support feature rows that dont exactly comply with the feature set that has been defined, but I do quite like the idea of making Feast a little bit more forward-compatible and not overly brittle. My intuition is that this approach seems sound and that it's the lesser of two evils, especially when we consider our plans w.r.t removing |
/retry |
/retest |
/lgtm |
Fixes #401, but also additionally does some preprocessing of the feature row in the ValidateFeatureRowDoFn to (1) filter out extra fields and (2) deduplicate fields
I'm still not 100% sure if we want to commit to this behaviour.
Pros:
Cons:
That being said deduplication is not being done by field name, but the entire field, so no information should be lost.